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1 ABSTRACT 

Emerging technology is enabling the design community to consistently expand the amount 
of functionality that can be implemented within Integrated Circuits (ICs). As the number of gates 
placed within an FPGA increases, the complexity of the design can grow exponentially. 
Consequently, the ability to create reliable circuits has become an incredibly difficult task. In 
order to ease the complexity of design completion, the commercial design community has 
developed a very rigid (but effective) design methodology based on synchronous circuit 
techniques. 

In order to create faster, smaller and lower power circuits, transistor geometries and core 
voltages have decreased. In environments that contain ionizing energy, such a combination will 
increase the probability of Single Event Upsets (SEUs) and will consequently affect the state 
space of a circuit. In order to combat the effects of radiation, the aerospace community has 
developed several “Hardened by Design” (fault tolerant) design schemes. 

This paper will address design mitigation schemes targeted for SRAM Based FPGA CMOS 
devices. Because some mitigation schemes may be over zealous (too much power, area, 
complexity, etc.. . .), the designer should be conscious that system requirements can ease the 
amount of mitigation necessary for acceptable operation. Therefore, various degrees of Fault 
Tolerance will be demonstrated along with an analysis of its effectiveness. 

2 SYNCHRONOUS DESIGN AND SOFT ERRORS WITHIN CMOS TECHNOLOGY 

The foundation of Synchronous Methodology is to assure deterministic behavior. One of the 
major key components of synchronous design techniques is employing clocks as a source of 
control. Designs are created such that portions of functionality can be completed within one 
clock cycle. This rule of thumb establishes a discrete means of verification, determinism, and 
traceability. 

Due to the reduction in core voltage, decrease in transistor geometry, and increase in 
switching speeds, CMOS transistors have become more susceptible to incurring faults. Upsets 
can be in the form of a flipped bit (DFF or memory) or a transient (glitch) within a combinatorial 
logic path (i.e. functional logic, clock, reset, receive buffers, transmit buffers, etc. . ..). Soft 
Flipped bit faults are labeled as Single Event Upsets (SEUs). Glitches within combinatorial 
circuitry are identified as Single Event Transients (SETs). Terrestrial devices have fault 
vulnerability mostly due to: alpha particles (from packaging and doping) and Neutrons (caused by 
Cosmic Ray Interactions that enter into the earth’s atmosphere). Designs targeted for higher 
altitude operations (Aerospace and Military) are more prone to upsets caused by heavy ions and 
protons. 

What is of major concern to a designer is that these upsets are unpredictable in occurrence 
and therefore asynchronous in nature. A great deal of research has been done concerning 
memory and DFF bit upsets and their associated fault tolerant solutions (generally assumes 
synchronous fault generation and synchronous correction). However, as system clock speeds 
increase, the probability of capturing combinatorial transients also increases. This error cross 
section can be greater than the probability of SEU error rates at high speed operation. The major 
caveat is that due to the asynchronous characteristics of the transients, metastability can occur and 
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cause major system malfunction. Unfortunately, finding a 100% effective solution is impossible 
when protecting circuitry from asynchronous events. However, the designer can reduce the 
probability of error by creating “clean” synchronous designs while strategically choosing fault 
tolerant mitigation techniques. 

Careful analysis of the points of failure (within the combinatorial paths and sequential 
components) must be performed in order to determine if the chosen error correction (or detection) 
scheme is beneficial or has created a more susceptible design. The analysis includes: definition 
of SEU and SET error cross sections, knowledge of maximum transient widths (based on process, 
implementation and operating environment), point of failure identification, probability analysis of 
multiple failures upon SEU/SET generation (due to fan - out and/or metastability), and 
probability of SET propagation (can be depleted on high capacitance nets or gates). 

3 FAULT TOLERANCE AND SRAM BASED FPGAS 

The definition of Fault Tolerance is the ability to mask or recover from erroneous conditions 
in a system once an error has been detected. The degree of fault tolerance implementation is 
defined by your system level requirements. . . i.e. specifications that clearly state acceptable 
behavior upon error. 

3.1 SRAM Based FPGAs and Aerospace Radiation Environments 

Due to the higher SEU and SET error rates within Aerospace and Military environments, 
the level of necessary fault tolerance can be very high. Proposed schemes for SRAM based 
FPGA designs can be very complex and almost impossible to implement. A maximally fault 
tolerant system (for SRAM based technology) would require: A scrubbing circuit for the SRAM 
configuration space (usually implemented within an anti-fuse FPGA technology), redundant 
memory (containing the configuration bit-stream), and a distributed logic voting scheme (tripling 
the design including I/O, clocks, resets, and all functional logic paths). Such an implementation 
can lead to extremely complex board designs with signal integrity issues and to unverifiable 
designs (very dangerous). Due to the lack of verification integrity and the extreme increase in 
complexity and cost, designers are investigating partial mitigation techniques with SRAM Based 
FPGAs (when acceptable to project specifications). 
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Feedback path will only be latched 
upon an active enable. Therefore, 
corrected data will not register 
until the next enable cycle 


Figure Is Internal Implementation of Distributed Triple Mode Redundancy (TMR) 
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3.2 SRAM Based FPGAs and Terrestrial Environments 

For terrestrial environments, requirements are more relaxed. The system may only necessitate 
scrubbing. However, partial mitigation strategies (for the functional logic) may also be 
necessary. A very popular example of partial mitigation is Error Detection and Correction 
circuitry. The point that is incredibly overlooked is that due to the ever increasing SET rates (due 
to the faster logic and faster clock frequencies), current ED AC implementations may increase the 
fault cross section rather than decrease it. As demonstrated in Figure 2, the EDAC circuitry will 
more likely have a higher probability of SET generation due to its transistor level complexity. At 
high operational frequencies, a SET within the feedback path can get caught by Several DFFs and 
cause uncorrectable results or metastability. An alternate approach can be to triple the EDAC 
logic, vote, and then feedback. The error cross section is now decreased from a higher 
probabilistic SET cloud of logic to a very small cross section consisting of the transistors that 
comprise the voter logic. 


Common Approach Ignoring Asynchronous Safer Mitigation Approach 

SET Generation and Repercussion 



Figure 2: Simplified Depiction of EDAC Circuitry and Proper Mitigation Techniques Concerning 
Asynchronous Fault Generation. 


4 SUMMARY 

CMOS technological improvements have lead to transistor level vulnerability to ionized particles. 
Based on the system level requirements and the operational environment, careful analysis of the 
proposed fault tolerant implementation must be performed. There exist mitigation strategies that 
can be overly complex and very costly. It is the designers’ responsibility to perform a detailed 
trade space of the necessary level of redundancy vs. its expense of implementation. This paper 
investigates types of fault tolerant schemes, and their effectiveness towards reaching given 
specification objectives. 


"Fault Tolerance Implementation within SRAM Based FPGA Designs based upon Single Event Upset 
Occurrence Rates" to be presented by Melanie Berg at the 12th IEEE International On-Line Testing 
Symposium, Grand Hotel di Como, Lake of Como, Italy, July 10-12, 2006 


3 








